Breast invasive carcinoma (BRCA), is the most common malignant cancer affecting women and is the second leading cause of cancer death worldwide. This disease has more than 1,300,000 cases and 450,000 death each year around the world[3]. This disease is widely heterogeneous, having a large and diverse set of molecular, histological and clinical behaviors depending of the tumour. The Cancer Genome Atlas (TCGA) has comprehensively profiled this type of cancer in a patient cohort. Here we analyze the expression profiles of those patients, accessible in the form of a raw RNA-seq counts produced by Rahman et al. (2015) using a pipeline based on the R/Bioconductor software package Rsubread.
We start importing the raw table of counts.
library(SummarizedExperiment)
se <- readRDS(file.path("data/seBRCA.rds"))
se
class: RangedSummarizedExperiment
dim: 20115 1232
metadata(5): experimentData annotation cancerTypeCode
cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowRanges metadata column names(3): symbol txlen txgc
colnames(1232): TCGA.3C.AAAU.01A.11R.A41B.07
TCGA.3C.AALI.01A.11R.A41B.07 ... TCGA.GI.A2C8.11A.22R.A16F.07
TCGA.GI.A2C9.11A.22R.A21T.07
colData names(549): type bcr_patient_uuid ...
lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
Explore the column (phenotypic) data, which in this case corresponds to clinical variables, and their corresponding metadata.
dim(colData(se))
[1] 1232 549
colData(se)[1:5, 1:5]
DataFrame with 5 rows and 5 columns
type bcr_patient_uuid
<factor> <factor>
TCGA.3C.AAAU.01A.11R.A41B.07 tumor 6E7D5EC6-A469-467C-B748-237353C23416
TCGA.3C.AALI.01A.11R.A41B.07 tumor 55262FCB-1B01-4480-B322-36570430C917
TCGA.3C.AALJ.01A.31R.A41B.07 tumor 427D0648-3F77-4FFC-B52C-89855426D647
TCGA.3C.AALK.01A.11R.A41B.07 tumor C31900A4-5DCD-4022-97AC-638E86E889E4
TCGA.4H.AAAK.01A.12R.A41B.07 tumor 6623FC5E-00BE-4476-967A-CBD55F676EA6
bcr_patient_barcode form_completion_date
<factor> <factor>
TCGA.3C.AAAU.01A.11R.A41B.07 TCGA-3C-AAAU 2014-1-13
TCGA.3C.AALI.01A.11R.A41B.07 TCGA-3C-AALI 2014-7-28
TCGA.3C.AALJ.01A.31R.A41B.07 TCGA-3C-AALJ 2014-7-28
TCGA.3C.AALK.01A.11R.A41B.07 TCGA-3C-AALK 2014-7-28
TCGA.4H.AAAK.01A.12R.A41B.07 TCGA-4H-AAAK 2014-11-13
prospective_collection
<factor>
TCGA.3C.AAAU.01A.11R.A41B.07 NO
TCGA.3C.AALI.01A.11R.A41B.07 NO
TCGA.3C.AALJ.01A.31R.A41B.07 NO
TCGA.3C.AALK.01A.11R.A41B.07 NO
TCGA.4H.AAAK.01A.12R.A41B.07 YES
mcols(colData(se), use.names=TRUE)
DataFrame with 549 rows and 2 columns
labelDescription
<character>
type sample type (tumor/normal)
bcr_patient_uuid bcr patient uuid
bcr_patient_barcode bcr patient barcode
form_completion_date form completion date
prospective_collection tissue prospective collection indicator
... ...
lymph_nodes_pelvic_pos_total total pelv lnp
lymph_nodes_aortic_examined_count total aor lnr
lymph_nodes_aortic_pos_by_he aln pos light micro
lymph_nodes_aortic_pos_by_ihc aln pos ihc
lymph_nodes_aortic_pos_total total aor-lnp
CDEID
<character>
type NA
bcr_patient_uuid NA
bcr_patient_barcode 2673794
form_completion_date NA
prospective_collection 3088492
... ...
lymph_nodes_pelvic_pos_total 3151828
lymph_nodes_aortic_examined_count 3104460
lymph_nodes_aortic_pos_by_he 3151832
lymph_nodes_aortic_pos_by_ihc 3151831
lymph_nodes_aortic_pos_total 3151827
These metadata consists of two columns of information about the clinical variables.
One called labelDescription contains a succinct description of the variable, often
not more self-explanatory than the variable name itself, and the other called
'CDEID' corresponds to the so-called Common Data Element (CDE) identifier. This
identifier can be use in https://cdebrowser.nci.nih.gov to search for further
information about the associated clinical variable using the Advanced search
form and the Public ID attribute search.
Now, explore the row (feature) data.
rowRanges(se)
GRanges object with 20115 ranges and 3 metadata columns:
seqnames ranges strand | symbol txlen
<Rle> <IRanges> <Rle> | <character> <integer>
1 chr19 [58345178, 58362751] - | A1BG 3322
2 chr12 [ 9067664, 9116229] - | A2M 4844
9 chr8 [18170477, 18223689] + | NAT1 2280
10 chr8 [18391245, 18401218] + | NAT2 1322
12 chr14 [94592058, 94624646] + | SERPINA3 3067
... ... ... ... ... ... ...
100996331 chr15 [20835372, 21877298] - | POTEB 1706
101340251 chr17 [40027542, 40027645] - | SNORD124 104
101340252 chr9 [33934296, 33934376] - | SNORD121B 81
102724473 chrX [49303669, 49319844] + | GAGE10 538
103091865 chr21 [39313935, 39314962] + | BRWD1-IT2 1028
txgc
<numeric>
1 0.5644190
2 0.4882329
9 0.3942982
10 0.3895613
12 0.5249429
... ...
100996331 0.4308324
101340251 0.4903846
101340252 0.4074074
102724473 0.5055762
103091865 0.5924125
-------
seqinfo: 455 sequences (1 circular) from hg38 genome
This feature data provides information about the genes, their symbols, location in the genome, length and gc content.
We focus on a subset of our data to perform the analysis. To create this subset we use a paired data criterion. That is, we keep only the data corresponding to patient that have samples of tumor and normal type. We use this criterion in order the minimize the genomic variability and supposedly reduce the probability of having confounding factors in our data. The following code performs this filtering:
codesnorm <- colData(se)[colData(se)$type == "normal",]$bcr_patient_barcode
codestum <- colData(se)[colData(se)$type == "tumor",]$bcr_patient_barcode
commoncodes <- codesnorm[codesnorm %in% codestum]
se <- se[,colData(se)$bcr_patient_barcode %in% commoncodes]
dim(se)
[1] 20115 212
table(se$type)
normal tumor
106 106
As we can see our filtered data-set consists now of 20115 genes for 212 samples, half of which are tumor type and the other half are normal samples.
To perform quality assessment and normalization we need first to load the edgeR R/Bioconductor package and create a DGEList' object.
library(edgeR)
names(se) <- rowRanges(se)$symbol
dge <- DGEList(counts=assays(se)$counts, genes=mcols(se))
Warning in as.data.frame.DataFrame(genes, stringsAsFactors = FALSE):
Arguments in '...' ignored
Now calculate $\log_2$ CPM values of expression and put them as an additional assay element to ease their manipulation.
assays(se)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
assays(se)$logCPM[1:5, 1:5]
TCGA.A7.A0CE.01A.11R.A00Z.07 TCGA.A7.A0CH.01A.21R.A00Z.07
A1BG 1.119416 2.494770
A2M 9.200341 8.459248
NAT1 -6.573341 -6.573341
NAT2 -6.573341 -6.573341
SERPINA3 5.657967 5.022802
TCGA.A7.A0D9.01A.31R.A056.07 TCGA.A7.A13F.01A.11R.A12P.07
A1BG 2.139362 3.796120
A2M 9.462463 9.682542
NAT1 -6.573341 -6.573341
NAT2 -6.573341 -6.573341
SERPINA3 5.812733 4.961102
TCGA.AC.A23H.01A.11R.A157.07
A1BG -0.5202016
A2M 8.1727633
NAT1 -6.5733408
NAT2 -6.5733408
SERPINA3 5.6981573
Let's examine the library sizes in terms of total number of sequence read counts per sample. Figure S1 below shows library sizes per sample in increasing order.
Figure S1: Library sizes in increasing order.
This figure reveals substantial differences in sequencing depth between samples, although this differences are quite homogeneously distributed between normal and tumor samples. We might filter the samples that have a library size with an extreme value, although that would disrupt our paired subset, so we decided to leave all samples.
Let's look at the distribution of expression values per sample in terms of logarithmic CPM units. Due to the large number of samples, we display tumor and normal samples separately, and are shown in Figure S2.
Figure S2: Non-parametric density distribution of expression profiles per sample.
In the figure we do not appreciate remarkable differences between the distribution of tumor samples and normal samples.Let's calculate now the average expression per gene through all the samples. Figure S3 shows the distribution of those values across genes.
Figure S3: Distribution of average expression level per gene.
In the light of this plot, we may consider a cutoff of 1 log CPM unit as minimum value of expression to select genes being expressed across samples. Using this cutoff we proceed to filter out lowly-expressed genes.
mask <- avgexp > 1
dim(se)
[1] 20115 212
dim(dge)
[1] 20115 212
se <- se[mask, ]
dge <- dge[mask, ]
dim(se)
[1] 11855 212
dim(dge)
[1] 11855 212
We calculate now the normalization factors on the filtered expression data set.
dge <- calcNormFactors(dge)
Replace the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment
object, by the normalized ones.
assays(se)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
We examine now the MA-plots of the normalized expression profiles. We look first to
the tumor samples.
Figure S4: MA-plots of the tumor samples.
We do not observe samples with major expression-level dependent biases. Let's look now to the normal samples.Figure S5: MA-plots of the normal samples.
Again, we do not observe either important expression-level dependent biases among the normal samples.
We will search now for potential surrogate of batch effect indicators. Given that each sample names corresponds to a TCGA barcode (see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview we are going to derive different elements of the TCGA barcode and examine their distribution across samples.
tss <- substr(colnames(se), 6, 7)
table(tss)
tss
A7 AC BH E2 E9 GI
8 8 142 20 30 4
center <- substr(colnames(se), 27, 28)
table(center)
center
07
212
plate <- substr(colnames(se), 22, 25)
table(plate)
plate
A00Z A056 A084 A089 A115 A12D A12P A137 A13Q A144 A14D A14M A157 A169 A16F
5 14 3 20 10 30 26 14 26 12 10 6 18 6 2
A17B A19E A19W A21T A466
4 1 2 2 1
portionanalyte <- substr(colnames(se), 18, 20)
table(portionanalyte)
portionanalyte
11R 12R 13R 21R 22R 23R 24R 31R 32R 33R 34R 41R 42R 43R 51R 52R 53R 61R
79 13 9 23 15 12 1 10 10 10 4 3 9 5 1 1 2 1
62R 71R 73R 94R
1 1 1 1
samplevial <- substr(colnames(se), 14, 16)
table(samplevial)
samplevial
01A 01B 11A 11B
105 1 93 13
From this information we can make the following observations:
All samples were sequenced at the same center
All samples belong to one of two combinations of tissue type and vial, matching the expected tumor and normal distribution, vials 01A and 01B correspond to tumor samples and 11A and 11B correspond to normal samples.
Samples were collected across different tissue source sites (TSS).
We are going to use the TSS as surrogate of batch effect indicator. Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with TSS.
table(data.frame(TYPE=se$type, TSS=tss))
TSS
TYPE A7 AC BH E2 E9 GI
normal 4 4 71 10 15 2
tumor 4 4 71 10 15 2
We see that the samples of each type are perfectly balanced over the TSS. We examine now how samples group together by hierarchical clustering and multidimensional scaling, annotating the outcome of interest and the the surrogate of batch indicator. We calculate again log CPM values with a higher prior count to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure S6.
logCPM <- cpm(dge, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(tss))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se)
outcome <- paste(substr(colnames(se), 9, 12), as.character(se$type), sep="-")
names(outcome) <- colnames(se)
sampleDendrogram <- dendrapply(sampleDendrogram,
function(x, batch, labels) {
if (is.leaf(x)) {
attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples",cex.lab=1.5, cex.axis=1.5, cex.main=1.5)
legend("topright", paste(levels(factor(tss))), fill=sort(unique(batch)))
Figure S6: Hierarchical clustering of the samples.
We can observe that samples cluster primarily by sample type, tumor or normal. TSS seems to have a stronger effect among the normal samples, while it distributes better among the tumor samples. We may consider discarding samples that do not seem to cluster as well across batches.
In Figure S7 we show the corresponding MDS plot. Here we see more clearly that the first source of variation separates tumor from normal samples. We can also observe that a few normal samples are separated from the rest, in a more clear separation than that shown in the hierarchical clustering. We may consider discarding those few samples and doing the MDS plot again to have a closer look to the differences among the rest of the samples and their relationship with TSS. In both figures S6 and S7 there a few samples that seem to deviate, but it is not a major deviation, so we decide to not exclude and continue to explore our data. Figures S6 and S7 are created by using TSS as surrogate variable, but we have done the same using the plate, analyte and vial as surrogate variables one at a time and have found similar results, since those plots do not contribute with relevant insight, they are omitted from this report.
plotMDS(dge, labels=outcome, col=batch,cex.lab=1.5, cex.axis=1.5, cex.main=1.5)
legend("bottomleft", paste(levels(factor(tss))),
fill=sort(unique(batch)), inset=0.05, horiz=TRUE, cex=1.3)
Figure S7: Multidimensional scaling plot of the samples.
We perform a simple examination of expression changes and their associated p-values using the R/Bioconductor package sva.
library(sva)
mod <- model.matrix(~ se$type, colData(se))
mod0 <- model.matrix(~ 1, colData(se))
pv <- f.pvalue(assays(se)$logCPM, mod, mod0)
sum(p.adjust(pv, method="fdr") < 0.01)
[1] 8393
There are 8393 genes changing significantly their expression at FDR < 1%. In Figure S8 below we show the distribution of the resulting p-values.
Figure S8: Distribution of raw p-values for an F-test on every gene between tumor and normal samples.
We see how a vast majority of our genes have a low p-value, to check the distribution of the rest of genes we will plot the histogram of the not significantly expressed genes, this histogram should correspond more or less to a uniform distribution.Figure S9: Distribution of p-values which are not significant.
The p-values are not perfectly uniform, although it does not seem too bad. Now, let's estimate surrogate variables using the sva() function.
sv <- sva(assays(se)$logCPM, mod, mod0)
Number of significant surrogate variables is: 27
Iteration (out of 5 ):1 2 3 4 5
sv$n
[1] 27
The SVA algorithm has found 27 surrogate variables. Let's use them to assess again the extent of differential expression this time adjusting for these surrogate variables.
modsv <- cbind(mod, sv$sv)
mod0sv <- cbind(mod0, sv$sv)
pvsv <- f.pvalue(assays(se)$logCPM, modsv, mod0sv)
sum(p.adjust(pvsv, method="fdr") < 0.01)
[1] 9450
We have increased the number of changing genes to 9450. Figure S10 shows the resulting distribution of p-values.
Figure S10: Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA.
Once again, we see how a vast majority of our genes have a low p-value, to check the distribution of the rest of genes we will plot the histogram of the not significantly expressed genes.
Figure S11: Distribution of p-values which are not significant after adjusting for surrogate variables estimated with SVA.
The p-values are not perfectly uniform, but the distribution seem a bit more uniform after adjusting.
The different QA diagnostics reveal some potentially problematic features in some of the samples. Some of this feature may be the difference in library size or variability that is not due to the sample type, as indicated by the MDS plot. However, these issues do not seem overly important, so we have decided to proceed with the analysis without eliminating any more samples after sub-setting, although we may consider discarding them in the future.
The main source of variation in this data seems to be driven by the tumor and normal condition of the samples. We have found no batch effect using the information from the TGCA barcode.
The extent of expression changes can be augmented when adjusting for surrogate variables estimated with SVA. Furthermore, the p-values that are not significant are distributed reasonably uniformly, and the distribution seems to improve after the SVA. Additionally, it would be interesting to observe how that extent changes when discarding potentially problematic samples.
sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 23 (Workstation Edition)
locale:
[1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C
[3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8
[5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8
[7] LC_PAPER=ca_ES.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] BiocInstaller_1.20.1 sva_3.18.0
[3] genefilter_1.52.1 mgcv_1.8-12
[5] nlme_3.1-127 geneplotter_1.48.0
[7] annotate_1.48.0 XML_3.98-1.4
[9] AnnotationDbi_1.32.3 lattice_0.20-33
[11] markdown_0.7.7 knitr_1.12.3
[13] SummarizedExperiment_1.0.2 Biobase_2.30.0
[15] GenomicRanges_1.22.4 GenomeInfoDb_1.6.3
[17] IRanges_2.4.8 S4Vectors_0.8.11
[19] BiocGenerics_0.16.1 edgeR_3.12.1
[21] limma_3.26.9
loaded via a namespace (and not attached):
[1] XVector_0.10.0 magrittr_1.5 splines_3.2.3
[4] zlibbioc_1.16.0 xtable_1.8-2 stringr_1.0.0
[7] highr_0.5.1 tools_3.2.3 grid_3.2.3
[10] KernSmooth_2.23-15 DBI_0.4 survival_2.39-2
[13] digest_0.6.9 Matrix_1.2-6 RColorBrewer_1.1-2
[16] formatR_1.3 codetools_0.2-14 mime_0.4
[19] evaluate_0.9 RSQLite_1.0.0 stringi_1.0-1